Cleaning: Cycle Share

There are 3 datasets that provide data on the stations, trips, and weather from 2014-2016.

Station dataset

  • station_id: station ID number
  • name: name of station
  • lat: station latitude
  • long: station longitude
  • install_date: date that station was placed in service
  • install_dockcount: number of docks at each station on the installation date
  • modification_date: date that station was modified, resulting in a change in location or dock count
  • current_dockcount: number of docks at each station on 8/31/2016
  • decommission_date: date that station was placed out of service

Trip dataset

  • trip_id: numeric ID of bike trip taken
  • starttime: day and time trip started, in PST
  • stoptime: day and time trip ended, in PST
  • bikeid: ID attached to each bike
  • tripduration: time of trip in seconds
  • from_station_name: name of station where trip originated
  • to_station_name: name of station where trip terminated
  • from_station_id: ID of station where trip originated
  • to_station_id: ID of station where trip terminated
  • usertype: "Short-Term Pass Holder" is a rider who purchased a 24-Hour or 3-Day Pass; "Member" is a rider who purchased a Monthly or an Annual Membership
  • gender: gender of rider
  • birthyear: birth year of rider

Weather dataset contains daily weather information in the service area

1. Import all sets into a dictionary and correct any errors

The trip file had the headers repeated after the values of a line, I simply got rid of them cancelling the values from the file. I also noticed that the first several rows were repeated and the line with the headers was one of those, so I used the values from the original line to fill in the missing ones.


In [1]:
import pandas as pd
import numpy as np

sets = ['station', 'trip', 'weather']

cycle = {}

for s in sets:
    cycle[s] = pd.read_csv('cycle_share/' + s + '.csv')

In [2]:
cycle['trip'].head()


Out[2]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31 10/13/2014 10:48 SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0
1 432 10/13/2014 10:32 10/13/2014 10:48 SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0
2 433 10/13/2014 10:33 10/13/2014 10:48 SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1988.0
3 434 10/13/2014 10:34 10/13/2014 10:48 SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1977.0
4 435 10/13/2014 10:34 10/13/2014 10:49 SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1971.0

2. Print data summaries including the number of null values. Should we drop or try to correct any of the null values?


In [3]:
for df in cycle:
    print(df)
    print(cycle[df].describe(include='all'))
    print('\n')


station
       station_id                   name        lat        long install_date  \
count          58                     58  58.000000   58.000000           58   
unique         58                     58        NaN         NaN            9   
top         PS-04  9th Ave N & Mercer St        NaN         NaN   10/13/2014   
freq            1                      1        NaN         NaN           50   
mean          NaN                    NaN  47.624796 -122.327242          NaN   
std           NaN                    NaN   0.019066    0.014957          NaN   
min           NaN                    NaN  47.598488 -122.355230          NaN   
25%           NaN                    NaN  47.613239 -122.338735          NaN   
50%           NaN                    NaN  47.618591 -122.328207          NaN   
75%           NaN                    NaN  47.627712 -122.316691          NaN   
max           NaN                    NaN  47.666145 -122.284119          NaN   

        install_dockcount modification_date  current_dockcount  \
count           58.000000                17          58.000000   
unique                NaN                12                NaN   
top                   NaN         2/20/2015                NaN   
freq                  NaN                 4                NaN   
mean            17.586207               NaN          16.517241   
std              3.060985               NaN           5.117021   
min             12.000000               NaN           0.000000   
25%             16.000000               NaN          16.000000   
50%             18.000000               NaN          18.000000   
75%             18.000000               NaN          18.000000   
max             30.000000               NaN          26.000000   

       decommission_date  
count                  4  
unique                 4  
top           10/29/2015  
freq                   1  
mean                 NaN  
std                  NaN  
min                  NaN  
25%                  NaN  
50%                  NaN  
75%                  NaN  
max                  NaN  


trip
              trip_id         starttime          stoptime    bikeid  \
count   286858.000000            286858            286858    286858   
unique            NaN            176216            169285       493   
top               NaN  10/13/2014 11:51  10/13/2014 11:51  SEA00281   
freq              NaN                18                18       835   
mean    112431.781746               NaN               NaN       NaN   
std      76565.086482               NaN               NaN       NaN   
min        431.000000               NaN               NaN       NaN   
25%      43051.000000               NaN               NaN       NaN   
50%     103486.500000               NaN               NaN       NaN   
75%     179544.750000               NaN               NaN       NaN   
max     255245.000000               NaN               NaN       NaN   

         tripduration                from_station_name    to_station_name  \
count   286858.000000                           286858             286858   
unique            NaN                               61                 61   
top               NaN  Pier 69 / Alaskan Way & Clay St  2nd Ave & Pine St   
freq              NaN                            13054              13784   
mean      1178.354284                              NaN                NaN   
std       2038.697070                              NaN                NaN   
min         60.008000                              NaN                NaN   
25%        387.925750                              NaN                NaN   
50%        624.846500                              NaN                NaN   
75%       1118.483250                              NaN                NaN   
max      28794.398000                              NaN                NaN   

       from_station_id to_station_id usertype  gender      birthyear  
count           286858        286858   286858  181557  181553.000000  
unique              61            61        2       3            NaN  
top              WF-01        CBD-13   Member    Male            NaN  
freq             13054         13784   181557  140564            NaN  
mean               NaN           NaN      NaN     NaN    1979.759062  
std                NaN           NaN      NaN     NaN      10.167119  
min                NaN           NaN      NaN     NaN    1931.000000  
25%                NaN           NaN      NaN     NaN    1974.000000  
50%                NaN           NaN      NaN     NaN    1983.000000  
75%                NaN           NaN      NaN     NaN    1987.000000  
max                NaN           NaN      NaN     NaN    1999.000000  


weather
             Date  Max_Temperature_F  Mean_Temperature_F  Min_TemperatureF  \
count         689         689.000000          688.000000        689.000000   
unique        689                NaN                 NaN               NaN   
top     4/22/2016                NaN                 NaN               NaN   
freq            1                NaN                 NaN               NaN   
mean          NaN          64.027576           56.584302         49.454282   
std           NaN          12.427843           10.408058          9.451437   
min           NaN          39.000000           33.000000         23.000000   
25%           NaN          55.000000           48.000000         43.000000   
50%           NaN          63.000000           56.000000         50.000000   
75%           NaN          73.000000           65.000000         57.000000   
max           NaN          98.000000           83.000000         70.000000   

        Max_Dew_Point_F  MeanDew_Point_F  Min_Dewpoint_F  Max_Humidity  \
count        689.000000       689.000000      689.000000    689.000000   
unique              NaN              NaN             NaN           NaN   
top                 NaN              NaN             NaN           NaN   
freq                NaN              NaN             NaN           NaN   
mean          48.571843        45.021771       40.873730     84.541364   
std            7.501230         7.914025        8.854608      9.718948   
min           10.000000         4.000000        1.000000     40.000000   
25%           44.000000        41.000000       36.000000     78.000000   
50%           50.000000        46.000000       42.000000     86.000000   
75%           54.000000        51.000000       47.000000     90.000000   
max           77.000000        59.000000       57.000000    100.000000   

        Mean_Humidity  Min_Humidity   ...    Mean_Sea_Level_Pressure_In  \
count      689.000000    689.000000   ...                    689.000000   
unique            NaN           NaN   ...                           NaN   
top               NaN           NaN   ...                           NaN   
freq              NaN           NaN   ...                           NaN   
mean        68.506531     49.973875   ...                     30.034761   
std         12.701871     15.825701   ...                      0.197503   
min         24.000000     15.000000   ...                     29.310000   
25%         60.000000     38.000000   ...                     29.930000   
50%         70.000000     50.000000   ...                     30.040000   
75%         79.000000     63.000000   ...                     30.160000   
max         95.000000     87.000000   ...                     30.810000   

        Min_Sea_Level_Pressure_In  Max_Visibility_Miles  \
count                  689.000000            689.000000   
unique                        NaN                   NaN   
top                           NaN                   NaN   
freq                          NaN                   NaN   
mean                    29.940610              9.989840   
std                      0.221803              0.266679   
min                     29.140000              3.000000   
25%                     29.840000             10.000000   
50%                     29.960000             10.000000   
75%                     30.080000             10.000000   
max                     30.750000             10.000000   

        Mean_Visibility_Miles  Min_Visibility_Miles  Max_Wind_Speed_MPH  \
count              689.000000            689.000000          689.000000   
unique                    NaN                   NaN                 NaN   
top                       NaN                   NaN                 NaN   
freq                      NaN                   NaN                 NaN   
mean                 9.429608              7.245283           11.085631   
std                  1.174360              3.281278            3.921087   
min                  1.000000              0.000000            4.000000   
25%                  9.000000              4.000000            8.000000   
50%                 10.000000              9.000000           10.000000   
75%                 10.000000             10.000000           13.000000   
max                 10.000000             10.000000           30.000000   

        Mean_Wind_Speed_MPH  Max_Gust_Speed_MPH Precipitation_In  Events  
count             689.00000                 504       689.000000     328  
unique                  NaN                  25              NaN       9  
top                     NaN                   -              NaN    Rain  
freq                    NaN                 225              NaN     287  
mean                4.63135                 NaN         0.105065     NaN  
std                 2.78032                 NaN         0.235644     NaN  
min                 0.00000                 NaN         0.000000     NaN  
25%                 3.00000                 NaN         0.000000     NaN  
50%                 4.00000                 NaN         0.000000     NaN  
75%                 6.00000                 NaN         0.090000     NaN  
max                23.00000                 NaN         2.200000     NaN  

[11 rows x 21 columns]


Gender and year of birth have nulls, I don't think we should drop them because we would lose over 100000 rows; instead we could use the median or mean to replace nulls for the year of birth. Regarding the gender it's not possible to make any replacement, but it should be noted that most of the entries are male.


In [5]:
cycle['trip'].groupby('gender')['trip_id'].count()


Out[5]:
gender
Female     37562
Male      140564
Other       3431
Name: trip_id, dtype: int64

3. Create a column in the trip table that contains only the date (no time)


In [39]:
#cycle['trip']['date'] = cycle['trip']['starttime'].apply(lambda x: pd.to_datetime(x[0:x.find(' ')], format='%m/%d/%Y'))
cycle['trip']['date'] = cycle['trip']['starttime'].apply(lambda x: x[0:x.find(' ')])
cycle['trip'].head()


Out[39]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear date
0 431 10/13/2014 10:31 10/13/2014 10:48 SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0 10/13/2014
1 432 10/13/2014 10:32 10/13/2014 10:48 SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0 10/13/2014
2 433 10/13/2014 10:33 10/13/2014 10:48 SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1988.0 10/13/2014
3 434 10/13/2014 10:34 10/13/2014 10:48 SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1977.0 10/13/2014
4 435 10/13/2014 10:34 10/13/2014 10:49 SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1971.0 10/13/2014

4. Merge weather data with trip data and be sure not to lose any trip data


In [40]:
trip_weather = pd.merge(cycle['trip'], cycle['weather'], left_on='date', right_on='Date', how='left')
trip_weather.head()


Out[40]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype ... Mean_Sea_Level_Pressure_In Min_Sea_Level_Pressure_In Max_Visibility_Miles Mean_Visibility_Miles Min_Visibility_Miles Max_Wind_Speed_MPH Mean_Wind_Speed_MPH Max_Gust_Speed_MPH Precipitation_In Events
0 431 10/13/2014 10:31 10/13/2014 10:48 SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 29.79 29.65 10 10 4 13 4 21 0.0 Rain
1 432 10/13/2014 10:32 10/13/2014 10:48 SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 29.79 29.65 10 10 4 13 4 21 0.0 Rain
2 433 10/13/2014 10:33 10/13/2014 10:48 SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 29.79 29.65 10 10 4 13 4 21 0.0 Rain
3 434 10/13/2014 10:34 10/13/2014 10:48 SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 29.79 29.65 10 10 4 13 4 21 0.0 Rain
4 435 10/13/2014 10:34 10/13/2014 10:49 SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 29.79 29.65 10 10 4 13 4 21 0.0 Rain

5 rows × 34 columns

5. Drop records that are completely duplicated (all values). Check for and inspect any duplicate trip_id values that remain. Remove if they exist.


In [68]:
print(len(trip_weather))
trip_weather = trip_weather.drop_duplicates()
print(len(trip_weather))


286858
236065

In [72]:
print(len(trip_weather['trip_id']))
print(len(trip_weather['trip_id'].unique()))


236065
236065

6. Create columns for lat & long values for the from- and to- stations


In [80]:
trip_weather = pd.merge(trip_weather, cycle['station'][['station_id', 'lat', 'long']], left_on='from_station_id', right_on='station_id', how='left').drop('station_id', axis=1)
trip_weather = pd.merge(trip_weather, cycle['station'][['station_id', 'lat', 'long']], left_on='to_station_id', right_on='station_id', how='left', suffixes=['_from_station', '_to_station']).drop('station_id', axis=1)
trip_weather.head()


Out[80]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype ... Min_Visibility_Miles Max_Wind_Speed_MPH Mean_Wind_Speed_MPH Max_Gust_Speed_MPH Precipitation_In Events lat_from_station long_from_station lat_to_station long_to_station
0 431 10/13/2014 10:31 10/13/2014 10:48 SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 4 13 4 21 0.0 Rain 47.60595 -122.335768 47.600757 -122.332946
1 432 10/13/2014 10:32 10/13/2014 10:48 SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 4 13 4 21 0.0 Rain 47.60595 -122.335768 47.600757 -122.332946
2 433 10/13/2014 10:33 10/13/2014 10:48 SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 4 13 4 21 0.0 Rain 47.60595 -122.335768 47.600757 -122.332946
3 434 10/13/2014 10:34 10/13/2014 10:48 SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 4 13 4 21 0.0 Rain 47.60595 -122.335768 47.600757 -122.332946
4 435 10/13/2014 10:34 10/13/2014 10:49 SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member ... 4 13 4 21 0.0 Rain 47.60595 -122.335768 47.600757 -122.332946

5 rows × 38 columns

7. Write a function to round all tripduration values to the nearest half second increment and then round all the values in the data


In [129]:
def round_trips(duration):
    roundings = np.array([np.floor(duration), np.floor(duration)+0.5, np.ceil(duration)])
    return roundings[np.argmin(np.abs(duration - roundings))]

trip_weather['tripduration'] = trip_weather['tripduration'].apply(round_trips)

In [130]:
trip_weather['tripduration'].head(10)


Out[130]:
0    986.0
1    926.5
2    884.0
3    866.0
4    924.0
5    809.0
6    596.5
7    592.0
8    586.5
9    587.5
Name: tripduration, dtype: float64

8. Verify that trip_duration matches the timestamps to within 60 seconds


In [140]:
trip_weather[np.abs(((pd.to_datetime(trip_weather['stoptime']) - pd.to_datetime(trip_weather['starttime'])) / np.timedelta64(1, 's')) - trip_weather['tripduration']) > 60]


Out[140]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype ... Min_Visibility_Miles Max_Wind_Speed_MPH Mean_Wind_Speed_MPH Max_Gust_Speed_MPH Precipitation_In Events lat_from_station long_from_station lat_to_station long_to_station
7040 8660 11/2/2014 1:29 11/2/2014 1:12 SEA00384 2571.0 Pine St & 9th Ave Westlake Ave & 6th Ave SLU-16 SLU-15 Short-Term Pass Holder ... 4 9 5 - 0.11 Rain 47.613715 -122.331777 47.613628 -122.337341
7041 8661 11/2/2014 1:29 11/2/2014 1:35 SEA00371 3986.0 Cal Anderson Park / 11th Ave & Pine St Cal Anderson Park / 11th Ave & Pine St CH-08 CH-08 Short-Term Pass Holder ... 4 9 5 - 0.11 Rain 47.615486 -122.318245 47.615486 -122.318245
7042 8662 11/2/2014 1:29 11/2/2014 1:35 SEA00170 3978.5 Cal Anderson Park / 11th Ave & Pine St Cal Anderson Park / 11th Ave & Pine St CH-08 CH-08 Short-Term Pass Holder ... 4 9 5 - 0.11 Rain 47.615486 -122.318245 47.615486 -122.318245
7043 8663 11/2/2014 1:29 11/2/2014 1:11 SEA00205 2513.5 Pine St & 9th Ave Westlake Ave & 6th Ave SLU-16 SLU-15 Short-Term Pass Holder ... 4 9 5 - 0.11 Rain 47.613715 -122.331777 47.613628 -122.337341
7044 8666 11/2/2014 1:31 11/2/2014 1:11 SEA00430 2398.5 Pine St & 9th Ave Westlake Ave & 6th Ave SLU-16 SLU-15 Short-Term Pass Holder ... 4 9 5 - 0.11 Rain 47.613715 -122.331777 47.613628 -122.337341
7045 8667 11/2/2014 1:37 11/2/2014 1:12 SEA00112 2074.0 Pine St & 9th Ave Westlake Ave & 6th Ave SLU-16 SLU-15 Short-Term Pass Holder ... 4 9 5 - 0.11 Rain 47.613715 -122.331777 47.613628 -122.337341
7046 8669 11/2/2014 1:44 11/2/2014 1:04 SEA00247 1201.5 2nd Ave & Vine St Key Arena / 1st Ave N & Harrison St BT-03 SLU-19 Member ... 4 9 5 - 0.11 Rain 47.615829 -122.348564 47.622277 -122.355230
7047 8670 11/2/2014 1:52 11/2/2014 1:07 SEA00460 919.0 2nd Ave & Vine St Summit Ave & E Denny Way BT-03 CH-01 Member ... 4 9 5 - 0.11 Rain 47.615829 -122.348564 47.618633 -122.325249
7048 8672 11/2/2014 1:59 11/2/2014 1:25 SEA00481 1565.0 UW Magnuson Health Sciences Center Rotunda / C... Children's Hospital / Sandpoint Way NE & 40th ... UW-10 DPD-03 Short-Term Pass Holder ... 4 9 5 - 0.11 Rain 47.650725 -122.311188 47.663509 -122.284119

9 rows × 38 columns

9.Something is wrong with the Max_Gust_Speed_MPH column. Identify and correct the problem, then save the data.


In [154]:
# not an int, let's convert it
trip_weather['Max_Gust_Speed_MPH'] = trip_weather['Max_Gust_Speed_MPH'].replace('-', np.NaN).astype('float')

In [155]:
trip_weather['Max_Gust_Speed_MPH'].describe()


Out[155]:
count    88509.000000
mean        21.287474
std          5.169371
min         16.000000
25%         18.000000
50%         20.000000
75%         24.000000
max         52.000000
Name: Max_Gust_Speed_MPH, dtype: float64

In [156]:
trip_weather.to_csv('cycle_share/trip_weather.csv')

Cleaning: Movies

This data set contains 28 attributes related to various movie titles that have been scraped from IMDb. The set is supposed to contain unique titles for each record, where each record has the following attributes:

"movie_title" "color" "num_critic_for_reviews" "movie_facebook_likes" "duration" "director_name" "director_facebook_likes" "actor_3_name" "actor_3_facebook_likes" "actor_2_name" "actor_2_facebook_likes" "actor_1_name" "actor_1_facebook_likes" "gross" "genres" "num_voted_users" "cast_total_facebook_likes" "facenumber_in_poster" "plot_keywords" "movie_imdb_link" "num_user_for_reviews" "language" "country" "content_rating" "budget" "title_year" "imdb_score" "aspect_ratio"

The original set is available kaggle (here)

1. Check for and correct similar values in color, language, and country


In [157]:
movies = pd.read_csv('movies/movies_data.csv')
movies.head()


Out[157]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns


In [224]:
movies.dtypes


Out[224]:
color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie_title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
movie_facebook_likes           int64
dtype: object

In [159]:
movies.describe(include='all')


Out[159]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 5024 4939 4993.000000 5028.000000 4939.000000 5020.000000 5030 5036.000000 4.159000e+03 5043 ... 5022.000000 5031 5038 4740 4.551000e+03 4935.000000 5030.000000 5043.000000 4714.000000 5043.000000
unique 4 2411 NaN NaN NaN NaN 3040 NaN NaN 914 ... NaN 47 65 18 NaN NaN NaN NaN NaN NaN
top Color Steven Spielberg NaN NaN NaN NaN Morgan Freeman NaN NaN Drama ... NaN English USA R NaN NaN NaN NaN NaN NaN
freq 4799 25 NaN NaN NaN NaN 20 NaN NaN 236 ... NaN 4704 3807 2118 NaN NaN NaN NaN NaN NaN
mean NaN NaN 140.194272 107.201074 686.509212 645.009761 NaN 6560.047061 4.846841e+07 NaN ... 272.770808 NaN NaN NaN 3.975262e+07 2002.470517 1651.754473 6.442138 2.220403 7525.964505
std NaN NaN 121.601675 25.197441 2813.328607 1665.041728 NaN 15020.759120 6.845299e+07 NaN ... 377.982886 NaN NaN NaN 2.061149e+08 12.474599 4042.438863 1.125116 1.385113 19320.445110
min NaN NaN 1.000000 7.000000 0.000000 0.000000 NaN 0.000000 1.620000e+02 NaN ... 1.000000 NaN NaN NaN 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
25% NaN NaN 50.000000 93.000000 7.000000 133.000000 NaN 614.000000 5.340988e+06 NaN ... 65.000000 NaN NaN NaN 6.000000e+06 1999.000000 281.000000 5.800000 1.850000 0.000000
50% NaN NaN 110.000000 103.000000 49.000000 371.500000 NaN 988.000000 2.551750e+07 NaN ... 156.000000 NaN NaN NaN 2.000000e+07 2005.000000 595.000000 6.600000 2.350000 166.000000
75% NaN NaN 195.000000 118.000000 194.500000 636.000000 NaN 11000.000000 6.230944e+07 NaN ... 326.000000 NaN NaN NaN 4.500000e+07 2011.000000 918.000000 7.200000 2.350000 3000.000000
max NaN NaN 813.000000 511.000000 23000.000000 23000.000000 NaN 640000.000000 7.605058e+08 NaN ... 5060.000000 NaN NaN NaN 1.221550e+10 2016.000000 137000.000000 9.500000 16.000000 349000.000000

11 rows × 28 columns


In [180]:
print(movies['color'].unique())
movies['color'] = movies['color'].apply(lambda x: 'Color' if x == 'color' else 'Black and White' if x == 'black and white' else x)
print(movies['color'].unique())


['Color' nan 'Black and White']
['Color' nan 'Black and White']

2. Create a function that detects and lists non-numeric columns containing values with leading or trailing whitespace. Remove the whitespace in these columns.


In [210]:
def find_spaces(df):
    cols = []
    for index, value in df.dtypes[df.dtypes == 'object'].iteritems():
        if df[index].str.startswith(' ').any() | df[index].str.endswith(' ').any():
            cols.append(index)
    
    return cols

find_spaces(movies)


Out[210]:
['director_name', 'actor_2_name', 'movie_title']

In [217]:
for col in find_spaces(movies):
    movies[col] = movies[col].str.lstrip().str.rstrip()

find_spaces(movies)


Out[217]:
[]

3. Remove duplicate records. Inspect any remaining duplicate movie titles.


In [221]:
print(len(movies))
movies = movies.drop_duplicates()
print(len(movies))


4998
4998

In [252]:
title_duplicates = list(movies['movie_title'].value_counts()[movies['movie_title'].value_counts() > 1].index)

movies[movies['movie_title'].isin(title_duplicates)].sort_values(by='movie_title')


Out[252]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
4894 Color Richard Fleischer 69.0 127.0 130.0 51.0 Robert J. Wilke 618.0 NaN Adventure|Drama|Family|Fantasy|Sci-Fi ... 108.0 English USA Approved 5000000.0 1954.0 53.0 7.2 1.37 0
3711 Color Richard Fleischer 69.0 127.0 130.0 51.0 Robert J. Wilke 617.0 NaN Adventure|Drama|Family|Fantasy|Sci-Fi ... 108.0 English USA Approved 5000000.0 1954.0 53.0 7.2 1.37 0
1420 Color Wes Craven 256.0 101.0 0.0 574.0 Lin Shaye 40000.0 26505000.0 Horror ... 668.0 English USA X 1800000.0 1984.0 852.0 7.5 1.85 10000
4352 Color Wes Craven 256.0 101.0 0.0 574.0 Lin Shaye 40000.0 26505000.0 Horror ... 668.0 English USA X 1800000.0 1984.0 852.0 7.5 1.85 10000
1113 Color Julie Taymor 156.0 133.0 278.0 107.0 T.V. Carpio 5000.0 24343673.0 Drama|Fantasy|Musical|Romance ... 524.0 English USA PG-13 45000000.0 2007.0 117.0 7.4 2.35 14000
4842 Color Julie Taymor 156.0 133.0 278.0 107.0 T.V. Carpio 5000.0 24343673.0 Drama|Fantasy|Musical|Romance ... 524.0 English USA PG-13 45000000.0 2007.0 117.0 7.4 2.35 14000
4128 Color Tim Burton 451.0 108.0 13000.0 11000.0 Alan Rickman 40000.0 334185206.0 Adventure|Family|Fantasy ... 736.0 English USA PG 200000000.0 2010.0 25000.0 6.5 1.85 24000
33 Color Tim Burton 451.0 108.0 13000.0 11000.0 Alan Rickman 40000.0 334185206.0 Adventure|Family|Fantasy ... 736.0 English USA PG 200000000.0 2010.0 25000.0 6.5 1.85 24000
1389 Color Cameron Crowe 138.0 105.0 488.0 13000.0 Bradley Cooper 15000.0 20991497.0 Comedy|Drama|Romance ... 172.0 English USA PG-13 37000000.0 2015.0 14000.0 5.4 1.85 11000
2639 Color Cameron Crowe 138.0 105.0 488.0 13000.0 Bradley Cooper 15000.0 20991497.0 Comedy|Drama|Romance ... 172.0 English USA PG-13 37000000.0 2015.0 14000.0 5.4 1.85 11000
3587 Color Frank Coraci 188.0 120.0 153.0 447.0 Steve Coogan 1000.0 24004159.0 Action|Adventure|Comedy ... 191.0 English USA PG 110000000.0 2004.0 1000.0 5.8 2.35 0
271 Color Frank Coraci 188.0 120.0 153.0 447.0 Steve Coogan 1000.0 24004159.0 Action|Adventure|Comedy ... 191.0 English USA PG 110000000.0 2004.0 1000.0 5.8 2.35 0
2613 Color Timur Bekmambetov 1.0 141.0 335.0 635.0 Ayelet Zurer 11000.0 NaN Adventure|Drama|History ... 1.0 English USA PG-13 100000000.0 2016.0 744.0 6.1 2.35 0
3967 Color Timur Bekmambetov 1.0 141.0 335.0 635.0 Ayelet Zurer 11000.0 NaN Adventure|Drama|History ... 1.0 English USA PG-13 100000000.0 2016.0 744.0 6.0 2.35 0
367 Color Timur Bekmambetov 1.0 141.0 335.0 635.0 Ayelet Zurer 11000.0 NaN Adventure|Drama|History ... 1.0 English USA PG-13 NaN 2016.0 745.0 6.1 2.35 0
2882 Color Jim Sheridan 217.0 105.0 260.0 1000.0 Jake Gyllenhaal 20000.0 28501651.0 Drama|Thriller ... 201.0 English USA R 26000000.0 2009.0 15000.0 7.1 2.35 10000
1852 Color Jim Sheridan 217.0 105.0 260.0 3000.0 Jake Gyllenhaal 20000.0 28501651.0 Drama|Thriller ... 201.0 English USA R 26000000.0 2009.0 15000.0 7.1 2.35 10000
4350 Color Kimberly Peirce 359.0 100.0 108.0 534.0 Judy Greer 17000.0 35266619.0 Drama|Fantasy|Horror ... 371.0 English USA R 30000000.0 2013.0 2000.0 5.9 2.35 39000
1662 Color Kimberly Peirce 359.0 100.0 108.0 534.0 Judy Greer 17000.0 35266619.0 Drama|Fantasy|Horror ... 371.0 English USA R 30000000.0 2013.0 2000.0 5.9 2.35 39000
286 Black and White Martin Campbell 400.0 144.0 258.0 834.0 Tobias Menzies 6000.0 167007184.0 Action|Adventure|Thriller ... 2301.0 English UK PG-13 150000000.0 2006.0 1000.0 8.0 2.35 0
2944 Black and White Martin Campbell 400.0 144.0 258.0 834.0 Tobias Menzies 6000.0 167007184.0 Action|Adventure|Thriller ... 2301.0 English UK PG-13 150000000.0 2006.0 1000.0 8.0 2.35 0
2063 Color Andy Cadiff 65.0 101.0 3.0 120.0 Stark Sands 448.0 12189514.0 Comedy|Romance ... 129.0 English USA PG-13 23000000.0 2004.0 143.0 6.1 2.35 0
1705 Color Andy Cadiff 65.0 101.0 3.0 120.0 Stark Sands 448.0 12189514.0 Comedy|Romance ... 129.0 English USA PG-13 23000000.0 2004.0 143.0 6.1 2.35 0
336 Color Kenneth Branagh 343.0 105.0 0.0 502.0 Derek Jacobi 2000.0 201148159.0 Drama|Family|Fantasy|Romance ... 322.0 English USA PG 95000000.0 2015.0 520.0 7.0 2.35 56000
4141 Color Kenneth Branagh 343.0 105.0 0.0 502.0 Derek Jacobi 2000.0 201148159.0 Drama|Family|Fantasy|Romance ... 323.0 English USA PG 95000000.0 2015.0 520.0 7.0 2.35 56000
213 Color Louis Leterrier 344.0 106.0 255.0 850.0 Jason Flemyng 14000.0 163192114.0 Action|Adventure|Fantasy ... 637.0 English USA PG-13 125000000.0 2010.0 1000.0 5.8 2.35 15000
2650 Color Louis Leterrier 344.0 106.0 255.0 850.0 Jason Flemyng 14000.0 163192114.0 Action|Adventure|Fantasy ... 637.0 English USA PG-13 125000000.0 2010.0 1000.0 5.8 2.35 15000
2193 Color John Milius 166.0 129.0 468.0 183.0 Mako 919.0 37567440.0 Adventure|Fantasy ... 337.0 English USA R 20000000.0 1982.0 691.0 6.9 2.35 0
390 Color John Milius 166.0 129.0 468.0 183.0 Mako 919.0 37567440.0 Adventure|Fantasy ... 337.0 English USA R 20000000.0 1982.0 691.0 6.9 2.35 0
430 Color George A. Romero 127.0 130.0 0.0 602.0 Hal Holbrook 875.0 NaN Comedy|Fantasy|Horror ... 211.0 English USA R 8000000.0 1982.0 826.0 6.9 1.85 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2105 Color Michael Bay 257.0 136.0 0.0 3000.0 Steve Buscemi 19000.0 35799026.0 Action|Adventure|Romance|Sci-Fi|Thriller ... 899.0 English USA PG-13 126000000.0 2005.0 12000.0 6.9 2.35 0
280 Color Michael Bay 257.0 136.0 0.0 3000.0 Steve Buscemi 19000.0 35799026.0 Action|Adventure|Romance|Sci-Fi|Thriller ... 899.0 English USA PG-13 126000000.0 2005.0 12000.0 6.9 2.35 0
1805 Color Jon Favreau 370.0 106.0 4000.0 591.0 Bill Murray 19000.0 362645141.0 Adventure|Drama|Family|Fantasy ... 398.0 English UK PG 175000000.0 2016.0 13000.0 7.8 1.85 65000
79 Color Jon Favreau 370.0 106.0 4000.0 591.0 Bill Murray 19000.0 362645141.0 Adventure|Drama|Family|Fantasy ... 398.0 English UK PG 175000000.0 2016.0 13000.0 7.8 1.85 65000
1184 Color John G. Avildsen 81.0 126.0 80.0 225.0 William Zabka 668.0 90800000.0 Action|Drama|Family|Sport ... 235.0 English USA PG 8000000.0 1984.0 641.0 7.2 1.85 0
3351 Color John G. Avildsen 81.0 126.0 80.0 225.0 William Zabka 668.0 90800000.0 Action|Drama|Family|Sport ... 235.0 English USA PG 8000000.0 1984.0 641.0 7.2 1.85 0
2647 Color Dennis Iliadis 241.0 114.0 29.0 616.0 Monica Potter 956.0 32721635.0 Crime|Horror|Thriller ... 279.0 English USA R 15000000.0 2009.0 878.0 6.6 1.85 0
4971 Color Dennis Iliadis 241.0 114.0 29.0 616.0 Monica Potter 956.0 32721635.0 Crime|Horror|Thriller ... 279.0 English USA R 15000000.0 2009.0 878.0 6.6 1.85 0
337 Color Peter Jackson 308.0 135.0 0.0 310.0 AJ Michalka 873.0 43982842.0 Drama|Fantasy|Thriller ... 593.0 English USA PG-13 65000000.0 2009.0 560.0 6.7 2.35 16000
4584 Color Peter Jackson 308.0 135.0 0.0 310.0 AJ Michalka 873.0 43982842.0 Drama|Fantasy|Thriller ... 593.0 English USA PG-13 65000000.0 2009.0 559.0 6.7 2.35 16000
84 Color Roland Joffé 10.0 109.0 596.0 283.0 Alice Englert 622.0 NaN Action|Adventure|Romance|Sci-Fi ... 15.0 English Belgium R NaN 2015.0 525.0 4.5 NaN 677
1508 Color Roland Joffé 10.0 109.0 596.0 282.0 Alice Englert 621.0 NaN Action|Adventure|Romance|Sci-Fi ... 15.0 English Belgium R NaN 2015.0 525.0 4.5 NaN 677
4150 Color Richard Donner 154.0 107.0 503.0 108.0 Patrick Troughton 264.0 NaN Fantasy|Horror|Mystery ... 310.0 English UK R 2800000.0 1976.0 139.0 7.6 2.35 7000
1894 Color Richard Donner 154.0 107.0 503.0 108.0 Patrick Troughton 264.0 NaN Fantasy|Horror|Mystery ... 310.0 English UK R 2800000.0 1976.0 139.0 7.6 2.35 7000
3578 Color Dan O'Bannon 203.0 108.0 228.0 358.0 Clu Gulager 431.0 NaN Comedy|Horror|Sci-Fi ... 300.0 English USA R 4000000.0 1985.0 426.0 7.3 1.85 0
3896 Color Dan O'Bannon 203.0 108.0 228.0 358.0 Clu Gulager 431.0 NaN Comedy|Horror|Sci-Fi ... 300.0 English USA R 4000000.0 1985.0 426.0 7.3 1.85 0
4936 Color Tobe Hooper 277.0 88.0 365.0 177.0 Edwin Neal 383.0 30859000.0 Horror|Thriller ... 826.0 English USA R 83532.0 1974.0 371.0 7.5 1.85 0
3278 Color Tobe Hooper 277.0 88.0 365.0 177.0 Edwin Neal 383.0 30859000.0 Horror|Thriller ... 826.0 English USA R 83532.0 1974.0 371.0 7.5 1.85 0
3170 Color Florian Henckel von Donnersmarck 321.0 103.0 207.0 3000.0 Angelina Jolie Pitt 40000.0 67631157.0 Action|Romance|Thriller ... 374.0 English USA PG-13 100000000.0 2010.0 11000.0 6.0 2.35 25000
305 Color Florian Henckel von Donnersmarck 321.0 103.0 207.0 3000.0 Angelina Jolie Pitt 40000.0 67631157.0 Action|Romance|Thriller ... 374.0 English USA PG-13 100000000.0 2010.0 11000.0 6.0 2.35 25000
4681 Color David S. Goyer 178.0 89.0 687.0 389.0 Atticus Shaffer 10000.0 42638165.0 Drama|Fantasy|Horror|Mystery|Thriller ... 230.0 English USA PG-13 16000000.0 2009.0 787.0 4.8 2.35 0
2546 Color David S. Goyer 178.0 89.0 687.0 389.0 Atticus Shaffer 10000.0 42638165.0 Drama|Fantasy|Horror|Mystery|Thriller ... 230.0 English USA PG-13 16000000.0 2009.0 787.0 4.8 2.35 0
4009 Color Akiva Schaffer 265.0 102.0 82.0 536.0 Nicholas Braun 622.0 34350553.0 Action|Comedy|Sci-Fi ... 207.0 English USA R 68000000.0 2012.0 591.0 5.7 2.35 15000
871 Color Akiva Schaffer 265.0 102.0 82.0 537.0 Nicholas Braun 622.0 34350553.0 Action|Comedy|Sci-Fi ... 207.0 English USA R 68000000.0 2012.0 591.0 5.7 2.35 15000
1378 Color Catherine Hardwicke 350.0 122.0 308.0 10000.0 Taylor Lautner 17000.0 191449475.0 Drama|Fantasy|Romance ... 1535.0 English USA PG-13 37000000.0 2008.0 12000.0 5.2 2.35 19000
2262 Color Catherine Hardwicke 350.0 122.0 308.0 10000.0 Taylor Lautner 17000.0 191449475.0 Drama|Fantasy|Romance ... 1535.0 English USA PG-13 37000000.0 2008.0 12000.0 5.2 2.35 19000
3981 Color Jaume Collet-Serra 349.0 113.0 174.0 767.0 Frank Langella 14000.0 61094903.0 Action|Mystery|Thriller ... 332.0 English UK PG-13 30000000.0 2011.0 902.0 6.9 2.35 29000
1232 Color Jaume Collet-Serra 349.0 113.0 174.0 767.0 Frank Langella 14000.0 61094903.0 Action|Mystery|Thriller ... 332.0 English UK PG-13 30000000.0 2011.0 903.0 6.9 2.35 29000
2099 Color Paul McGuigan 159.0 110.0 118.0 287.0 Spencer Wilding 11000.0 5773519.0 Drama|Horror|Sci-Fi|Thriller ... 91.0 English USA PG-13 40000000.0 2015.0 1000.0 6.0 2.35 11000
1146 Color Paul McGuigan 159.0 110.0 118.0 287.0 Spencer Wilding 11000.0 5773519.0 Drama|Horror|Sci-Fi|Thriller ... 91.0 English USA PG-13 40000000.0 2015.0 1000.0 6.0 2.35 11000

161 rows × 28 columns


In [256]:
print(movies.loc[337])
print(movies.loc[4584])


color                                                                    Color
director_name                                                    Peter Jackson
num_critic_for_reviews                                                     308
duration                                                                   135
director_facebook_likes                                                      0
actor_3_facebook_likes                                                     310
actor_2_name                                                       AJ Michalka
actor_1_facebook_likes                                                     873
gross                                                              4.39828e+07
genres                                                  Drama|Fantasy|Thriller
actor_1_name                                                 Michael Imperioli
movie_title                                                   The Lovely Bones
num_voted_users                                                         125109
cast_total_facebook_likes                                                 2370
actor_3_name                                                      Tom McCarthy
facenumber_in_poster                                                         1
plot_keywords                            1970s|afterlife|heaven|pedophile|rape
movie_imdb_link              http://www.imdb.com/title/tt0380510/?ref_=fn_t...
num_user_for_reviews                                                       593
language                                                               English
country                                                                    USA
content_rating                                                           PG-13
budget                                                                 6.5e+07
title_year                                                                2009
actor_2_facebook_likes                                                     560
imdb_score                                                                 6.7
aspect_ratio                                                              2.35
movie_facebook_likes                                                     16000
Name: 337, dtype: object
color                                                                    Color
director_name                                                    Peter Jackson
num_critic_for_reviews                                                     308
duration                                                                   135
director_facebook_likes                                                      0
actor_3_facebook_likes                                                     310
actor_2_name                                                       AJ Michalka
actor_1_facebook_likes                                                     873
gross                                                              4.39828e+07
genres                                                  Drama|Fantasy|Thriller
actor_1_name                                                 Michael Imperioli
movie_title                                                   The Lovely Bones
num_voted_users                                                         125114
cast_total_facebook_likes                                                 2369
actor_3_name                                                      Tom McCarthy
facenumber_in_poster                                                         1
plot_keywords                            1970s|afterlife|heaven|pedophile|rape
movie_imdb_link              http://www.imdb.com/title/tt0380510/?ref_=fn_t...
num_user_for_reviews                                                       593
language                                                               English
country                                                                    USA
content_rating                                                           PG-13
budget                                                                 6.5e+07
title_year                                                                2009
actor_2_facebook_likes                                                     559
imdb_score                                                                 6.7
aspect_ratio                                                              2.35
movie_facebook_likes                                                     16000
Name: 4584, dtype: object

4. Create a function that returns two arrays: one for titles that are truly duplicated, and one for duplicated titles are not the same movie.

  • hint: do this by comparing the imdb link values

In [269]:
true_dup = []
false_dup = []

for title in title_duplicates:
    for index, value in movies[movies['movie_title'] == title]['movie_imdb_link'].value_counts().iteritems():
        if value > 1:
            true_dup.append(title)
        else:
            false_dup.append(title)
            break

print(true_dup)
print(false_dup)


['Home', 'Ben-Hur', 'King Kong', 'The Watch', 'Clash of the Titans', 'A Nightmare on Elm Street', '20,000 Leagues Under the Sea', 'Poltergeist', 'The Gambler', 'Planet of the Apes', 'Ghostbusters', 'Spider-Man 3', 'Jack Reacher', 'The Omen', 'Day of the Dead', 'Dredd', 'Murder by Numbers', 'Juno', 'Precious', 'Halloween', 'Skyfall', 'The Fog', 'The Gift', 'The Fast and the Furious', 'Cinderella', 'Glory', 'Creepshow', 'Lolita', "The Astronaut's Wife", 'Casino Royale', 'Mercury Rising', 'The Great Gatsby', 'Syriana', 'Disturbia', 'Unknown', 'Brothers', 'The Jungle Book', 'The Tourist', 'Point Break', 'The Island', 'Pan', 'Aloha', 'The Lovely Bones', 'RoboCop', 'First Blood', 'Twilight', 'Conan the Barbarian', 'The Karate Kid', 'Around the World in 80 Days', 'Goosebumps', 'The Return of the Living Dead', 'Snitch', 'Dodgeball: A True Underdog Story', 'Exodus: Gods and Kings', 'The Day the Earth Stood Still', 'Oz the Great and Powerful', 'The Last House on the Left', 'Dawn of the Dead', 'Sabotage', 'The Unborn', 'Lucky Number Slevin', 'Carrie', 'House of Wax', 'Chasing Liberty', 'The Lovers', 'Heist', 'Teenage Mutant Ninja Turtles', 'The Texas Chain Saw Massacre', 'Alice in Wonderland', 'TRON: Legacy', 'Dekalog', 'Snakes on a Plane', 'Side Effects', 'Victor Frankenstein', 'Across the Universe', 'Eddie the Eagle']
['The Dead Zone', 'Out of the Blue', 'The Host']

5. Alter the names of duplicate titles that are different movies so each is unique. Then drop all duplicate rows based on movie title.


In [299]:
movies['movie_title'] = movies.apply(lambda x: x['movie_title'] + ' (' + str(int(x['title_year'])) + ')' if str(x['title_year']) != 'nan' and x['movie_title'] in false_dup else x['movie_title'], axis=1)

In [303]:
print(len(movies))
movies = movies.drop_duplicates('movie_title')
print(len(movies))


4998
4919

6. Create a series that ranks actors by proportion of movies they have appeared in


In [322]:
actors = movies.groupby(['actor_1_name'])['movie_title'].count()
actors = actors.add(movies.groupby(['actor_2_name'])['movie_title'].count(), fill_value=0)
actors = actors.add(movies.groupby(['actor_3_name'])['movie_title'].count(), fill_value=0)

(actors / len(movies)).sort_values(ascending=False).head(20)


Out[322]:
Robert De Niro        0.010775
Morgan Freeman        0.008742
Bruce Willis          0.007725
Matt Damon            0.007522
Johnny Depp           0.007319
Steve Buscemi         0.007319
Nicolas Cage          0.006709
Brad Pitt             0.006709
Bill Murray           0.006505
Will Ferrell          0.006505
Liam Neeson           0.006505
Denzel Washington     0.006302
Anthony Hopkins       0.006099
Jim Broadbent         0.005896
J.K. Simmons          0.005896
Harrison Ford         0.005896
Robert Downey Jr.     0.005692
Tom Cruise            0.005692
Tom Hanks             0.005692
Scarlett Johansson    0.005489
Name: movie_title, dtype: float64

7. Create a table that contains the first and last years each actor appeared, and their length of history. Then include columns for the actors proportion and total number of movies.

  • length is number of years they have appeared in movies

In [353]:
actor_years = movies.groupby(['actor_1_name'])['title_year'].aggregate({'min_year_1': np.min, 'max_year_1': np.max})
actor_years = actor_years.add(movies.groupby(['actor_2_name'])['title_year'].aggregate({'min_year_2': np.min, 'max_year_2': np.max}), fill_value=0)
actor_years = actor_years.add(movies.groupby(['actor_3_name'])['title_year'].aggregate({'min_year_3': np.min, 'max_year_3': np.max}), fill_value=0)

actor_years['first_year'] = np.min(actor_years[['min_year_1', 'min_year_2', 'min_year_3']], axis=1)
actor_years['last_year'] = np.max(actor_years[['max_year_1', 'max_year_2', 'max_year_3']], axis=1)

actor_years.drop(['min_year_1', 'min_year_2', 'min_year_3', 'max_year_1', 'max_year_2', 'max_year_3'], axis=1, inplace=True)

actor_years['history_length'] = actor_years['last_year'] - actor_years['first_year']

actor_years['movie_number'] = actors
actor_years['movie_proportion'] = actors / len(movies)

actor_years


Out[353]:
first_year last_year history_length movie_number movie_proportion
50 Cent 2005.0 2015.0 10.0 5.0 0.001016
A. Michael Baldwin 1988.0 1988.0 0.0 1.0 0.000203
A.J. Buckley 2000.0 2015.0 15.0 5.0 0.001016
A.J. DeLucia 2015.0 2015.0 0.0 1.0 0.000203
A.J. Langer 1998.0 1998.0 0.0 1.0 0.000203
AJ Michalka 2009.0 2011.0 2.0 2.0 0.000407
Aaliyah 2000.0 2002.0 2.0 2.0 0.000407
Aaron Ashmore 2004.0 2015.0 11.0 2.0 0.000407
Aaron Hill NaN NaN NaN 1.0 0.000203
Aaron Hughes 2007.0 2007.0 0.0 1.0 0.000203
Aaron Kwok 2016.0 2016.0 0.0 1.0 0.000203
Aaron Stanford 2000.0 2003.0 3.0 4.0 0.000813
Aaron Staton 2007.0 2007.0 0.0 1.0 0.000203
Aaron Yoo 2007.0 2015.0 8.0 5.0 0.001016
Aasheekaa Bathija 2004.0 2004.0 0.0 1.0 0.000203
Aasif Mandvi 2008.0 2012.0 4.0 4.0 0.000813
Abbie Cornish 2007.0 2014.0 7.0 7.0 0.001423
Abby Elliott 2011.0 2011.0 0.0 1.0 0.000203
Abby Mukiibi Nkaaga 2006.0 2006.0 0.0 1.0 0.000203
Abhishek Bachchan 2005.0 2011.0 6.0 3.0 0.000610
Abigail Evans 2014.0 2014.0 0.0 1.0 0.000203
Abigail Spencer 2012.0 2014.0 2.0 4.0 0.000813
Abraham Benrubi 1997.0 2016.0 19.0 3.0 0.000610
Ace Marrero 2013.0 2013.0 0.0 1.0 0.000203
Adam Alexi-Malle 1999.0 1999.0 0.0 1.0 0.000203
Adam Arkin 1981.0 2012.0 31.0 4.0 0.000813
Adam Baldwin 1980.0 2005.0 25.0 6.0 0.001220
Adam Boyer 2006.0 2006.0 0.0 1.0 0.000203
Adam Brown 2012.0 2014.0 2.0 3.0 0.000610
Adam Butcher 2004.0 2004.0 0.0 1.0 0.000203
... ... ... ... ... ...
Zakes Mokae 1981.0 1995.0 14.0 2.0 0.000407
Zan Marolt 2007.0 2007.0 0.0 1.0 0.000203
Zane Holtz 2003.0 2003.0 0.0 1.0 0.000203
Zayn Malik 2013.0 2013.0 0.0 1.0 0.000203
Zelda Rubinstein 1982.0 1988.0 6.0 2.0 0.000407
Zen McGrath 2014.0 2014.0 0.0 1.0 0.000203
Zena Grey 2010.0 2010.0 0.0 1.0 0.000203
Zero Mostel 1966.0 1966.0 0.0 1.0 0.000203
Zhanna Friske 2004.0 2004.0 0.0 1.0 0.000203
Zheng Xu 2006.0 2009.0 3.0 2.0 0.000407
Zhengyong Zhang 2004.0 2004.0 0.0 1.0 0.000203
Zhonghua Chen 2006.0 2006.0 0.0 1.0 0.000203
Zinedine Soualem 1996.0 2010.0 14.0 2.0 0.000407
Zoe Kazan 2008.0 2015.0 7.0 5.0 0.001016
Zoe Levin 2010.0 2010.0 0.0 1.0 0.000203
Zoe Lister-Jones 2009.0 2009.0 0.0 1.0 0.000203
Zoe Ventoura 2014.0 2014.0 0.0 1.0 0.000203
Zoey Deutch 2013.0 2016.0 3.0 2.0 0.000407
Zohra Segal 2005.0 2005.0 0.0 1.0 0.000203
Zooey Deschanel 1999.0 2011.0 12.0 19.0 0.003863
Zoë Bell 2007.0 2015.0 8.0 3.0 0.000610
Zoë Kravitz 2007.0 2016.0 9.0 8.0 0.001626
Zoë Poledouris 2001.0 2001.0 0.0 1.0 0.000203
Zubaida Sahar 2003.0 2003.0 0.0 1.0 0.000203
Zuhair Haddad 2014.0 2014.0 0.0 1.0 0.000203
Álex Angulo 2014.0 2014.0 0.0 1.0 0.000203
Ángela Molina 2014.0 2014.0 0.0 1.0 0.000203
Émilie Dequenne 2009.0 2009.0 0.0 1.0 0.000203
Ólafur Darri Ólafsson NaN NaN NaN 1.0 0.000203
Óscar Jaenada 2010.0 2016.0 6.0 3.0 0.000610

6255 rows × 5 columns

8. Create a column that gives each movie an integer ranking based on gross sales

  • 1 should indicate the highest gross
  • If more than one movie has equal sales, assign all the lowest rank in the group
  • The next rank after this group should increase only by 1

In [372]:
movies['gross_sales_rank'] = movies['gross'].rank(method='dense', ascending=False, na_option='bottom')
movies[['movie_title', 'gross', 'gross_sales_rank']].sort_values(by='gross_sales_rank').head(20)


Out[372]:
movie_title gross gross_sales_rank
0 Avatar 760505847.0 1.0
26 Titanic 658672302.0 2.0
29 Jurassic World 652177271.0 3.0
17 The Avengers 623279547.0 4.0
66 The Dark Knight 533316061.0 5.0
240 Star Wars: Episode I - The Phantom Menace 474544677.0 6.0
3024 Star Wars: Episode IV - A New Hope 460935665.0 7.0
8 Avengers: Age of Ultron 458991599.0 8.0
3 The Dark Knight Rises 448130642.0 9.0
582 Shrek 2 436471036.0 10.0
3080 E.T. the Extra-Terrestrial 434949459.0 11.0
186 The Hunger Games: Catching Fire 424645577.0 12.0
13 Pirates of the Caribbean: Dead Man's Chest 423032628.0 13.0
509 The Lion King 422783777.0 14.0
43 Toy Story 3 414984497.0 15.0
32 Iron Man 3 408992272.0 16.0
439 The Hunger Games 407999255.0 17.0
27 Captain America: Civil War 407197282.0 18.0
161 Spider-Man 403706375.0 19.0
36 Transformers: Revenge of the Fallen 402076689.0 20.0